Recency, Frequency, Monetary (RFM) Modelling is a modelling technique widely used in the marketing industry. The main aim of the model is to segment a customer base into different categories based on their spending habits and potential customer value. Each customer has a recency value (how long ago they last purchased from the company), a frequency value (how many times they have purchased from the company) and a monetary value (how much they have spent). By clustering customers based on these metrics, a company is able to separate customer groups. For example, customers with high monetary and frequency scores could be seen as 'high value, loyal' customers, whereas those with low monetary and recency scores could be seen as 'low-value lost' customers.
Research question 1 aims to see if this type of modelling could be applied to seismic data, to see how versatile this modelling technique is when applied to a different domain, and if it could be used to model earthquake risk effectively. The objective is to segment counties in the USA using earthquake recency, frequency and cumulative magnitude for risk modelling.
The necessary modules and settings needed throughout the project were initialised.
import os #for working directory
import csv #for handling csv data
import pandas as pd #for data manipulation and processing
import numpy as np #for mathematical operations
import datetime as dt #for dealing with temporal features
import seaborn as sns #for visual exploratory analysis
from shapely.geometry import Point #for spatial and geometric operations
import geopandas as gpd #for manipulation and processing of spatial data
import matplotlib.pyplot as plt #for data visualisation
import time #for dealing with temporal data
import plotly
import folium
from folium import plugins
pd.set_option('display.max_columns', None) #displays all columns in dataframs
Some modules specific to plotly are required to run this too. These are shown below.
#plotly specifics
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
init_notebook_mode()
plotly.tools.set_credentials_file(username='rhurley46', api_key='qnAm0NMkkIMnNr1w26H4')
import warnings
warnings.filterwarnings('ignore')
Next, the working directory was set to the location of the downloaded data.
#change working directory
os.chdir(r'C:\Users\aczd067\OD\INM430-PriciplesOfDataScience\Fracking\Data\2.5+')
#create list of files in dirctory
files = os.listdir(r'C:\Users\aczd067\OD\INM430-PriciplesOfDataScience\Fracking\Data\2.5+')
for name in files:
print(name)
The 8 files contain information about individual seismic events over 2.5m in the USA from 1974 to November 2018. The data was combined into a single dataframe.
#read each earthquake csv into dataframes
df1 = pd.read_csv(files[0], parse_dates=['time'], engine = 'python')
df2 = pd.read_csv(files[1], parse_dates=['time'], engine = 'python')
df3 = pd.read_csv(files[2], parse_dates=['time'], engine = 'python')
df4 = pd.read_csv(files[3], parse_dates=['time'], engine = 'python')
df5 = pd.read_csv(files[4], parse_dates=['time'], engine = 'python')
df6 = pd.read_csv(files[5], parse_dates=['time'], engine = 'python')
df7 = pd.read_csv(files[6], parse_dates=['time'], engine = 'python')
df8 = pd.read_csv(files[7], parse_dates=['time'], engine = 'python')
#create list of dataframes
alldfs = [df1, df2, df3, df4, df5, df6, df7, df8]
#concatenate individual dataframes into one
df = pd.concat(alldfs)
Overall data summaries are shown below, and then the first 5 observations.
df.info()
df.describe()
df.head()
From the summary, there are 116950 observations, each an individual earthquake. The data is mostly complete, although some features are missing lots of data. For example, 'dmin' with only 73,547 records, as well as 'horizontalError', 'depthError' and 'magError', which are missing a lot of data points too. This will need to be considered and dealt with appropriately during analysis.
In terms of outliers, at this stage there are some indications of outliers in the data. This can be seen in instances, for example, where the mean of 'horizontalError' is 1.690917, and the max is 99.000000. This too indicates that there are outliers in the data that will need to be dealt with appropriately.
We can remove some features as that are not relevant. However, we must first understand what each column means. This metadata is taken from the US Geological Survey website.
'time' - The time (UTC) the earthquake occurred.
'latitude' - Decimal latitude of the eathquake.
'longitude' - Decimal longitude of the earthquake.
'depth' - The depth of the earthquake below Earth surface.
'mag' - The magnitude of the earthquake on the richter scale.
'magType' - The method or algorithm used to calculate the preferred magnitude for the event.
'nst' - The total number of seismic stations used to determine earthquake location.
'gap' - The largest azimuthal gap between azimuthally adjacent stations (in degrees). In general, the smaller this number, the more reliable is the calculated horizontal position of the earthquake. Earthquake locations in which the azimuthal gap exceeds 180 degrees typically have large location and depth uncertainties.
'dmin' - Horizontal distance from the epicenter to the nearest station (in degrees). 1 degree is approximately 111.2 kilometers. In general, the smaller this number, the more reliable is the calculated depth of the earthquake.
'rms' - This parameter provides a measure of the fit of the observed arrival times to the predicted arrival times for this location. Smaller numbers reflect a better fit of the data.
'net' - The ID of a data contributor. Identifies the network considered to be the preferred source of information for this event.
id - A unique identifier for the event.
updated - Time when the event was most recently updated.
place - Textual description of named geographic region near to the event. This may be a city name, or a Flinn-Engdahl Region name.
type - Type of seismic event. eg. “earthquake”, “quarry”.
horizontalError - Uncertainty of reported location of the event in kilometers.
depthError - Uncertainty of reported depth of the event in kilometers.
magError - Uncertainty of reported magnitude of the event. The estimated standard error of the magnitude.
magNst - The total number of seismic stations used to calculate the magnitude for this earthquake.
status - Indicates whether the event has been reviewed by a human.
locationSource - The network that originally authored the reported location of this event.
magSource - Network that originally authored the reported magnitude for this event.
Columns that added no value to the analysis were removed.
These were; 'magType', 'net', 'id', 'updated', 'locationSource', 'magSource'.
df2 = df
df2 = df2.drop(['magType', 'net', 'id', 'updated', 'locationSource', 'magSource'], axis=1)
print("Remaining column names: \n \n", df2.columns.values)
Next, any events where 'type' is not 'earthquake' were removed. For example, mining explosions, sonic booms. Only the earthquakes that have been reviewed were kept.
print("Possible earthquake types: \n\n", df2['type'].unique())
df3 = df2[df2['type'] == 'earthquake']
df3 = df3[df3['status'] == 'reviewed']
print("\n Remaing observations after removal: ", len(df3))
Outliers in the data were then removed.
It is important to consider that sometimes, and in this project, outliers may be what we are looking for. However, they often can indicate poor quality data.
To ensure that the earthquake data is credible, any outliers in the measurement error columns were deleted, which indicated poor quality data.
Some measurements were missing, so the missing values were imputed with the median. The median was chosen instead of the mean in this case because outliers which are yet to be removed may affect the mean.
#impute median to measurement error columns
df4 = df3.fillna(df3.median())
After dealing with missing data, any outliers over 2 standard deviation from the mean from the measurement columns were removed.
#otlier removal 3std
#remove any measurement errors that are over +3 standard deviations
df5 = df4[df4['horizontalError']-df4['horizontalError'].mean() < (2*df4['horizontalError'].std())]
df5 = df4[df4['depthError']-df4['depthError'].mean() < (2*df4['depthError'].std())]
df5 = df4[df4['magError']-df4['magError'].mean() < (2*df4['magError'].std())]
print("\n Remaing observations after outlier removal: ", len(df5))
Deletion of redundant data, filtering of earthquake type, imputation of missing values and removal of erroneous measurement through outlier detection had been carried out.
The earthquake dataset was the usable and credible.
The timescale of the data is shown below.
#convert 'time' to datetime format
df5['time'] = pd.to_datetime(df5['time'])
#format datetime
df5['time'] = pd.to_datetime(df5['time'], format= '%Y:%m:%d %H:%M:%S')
#print first date
print("First Earthquake: ", df5['time'].min())
#print last data
print("Last Earthquake: ", df5['time'].max())
The data was from 30th November 1975 until 1st November 2018. This was trimmed down so there are only earthquakes from the start of 1975 to the end of 2017.
#trim date range of data
quakes = df5[(df5['time'].dt.year >= 1975) & (df5['time'].dt.year <= 2017)]
#order dataframe by time
quakes = quakes.sort_values(by='time', ascending=True)
print("First Earthquake: ", quakes['time'].min())
print("Last Earthquake: ", quakes['time'].max())
print("\n Number of observations: ", len(quakes))
After this, there was a trimmed, ordered dataframe containing 104,130 recorded earthquakes between 1975 and 2017, inclusive.
The dataframe was converted to a spatial dataframe which was needed for conversion to geodataframe and subsequent spatial queries and joins. A new column, 'coordinates' was derived from existing 'latitude' and 'longitude' columns.
#create coodinates column from lat long columns
quakes['coordinates'] = list(zip(quakes.longitude, quakes.latitude))
These were used as geometry for spatial operations after applying shapely's 'Point' method.
#convert coordinate to spatial point geometry
quakes['coordinates'] = quakes['coordinates'].apply(Point)
The dataframe was transformed into a spatial dataframe with geopandas.
#define geodataframe
quakes_gdf = gpd.GeoDataFrame(quakes, geometry = 'coordinates')
#set projection system
quakes_gdf.crs = {'init':'epsg:4326'}
This was then a spatial dataframe of the earthquake data, meaning that the data was able to be plotted on maps and have spatial operations performed on it.
The earthquake data was a rich dataset. However, one thing that it did lack was more detail about the location of the earthquakes. Despite latitude and longitude being present, the 'place' feature was vague at times, and not consistent enough for analysis. It would be ideal to have the name of the US state and state county of each earthquake.
These features were derived another way. This was done through a 'spatial join'. External shapefiles were be imported, and a spatial join was carried out on the existing 'quakes_gdf' data, and the shapefile to extract the required information.
#change working directory
os.chdir(r'C:\Users\aczd067\OD\INM430-PriciplesOfDataScience\Fracking\Data\Geospatial\USA_shps')
#load in counties shapefile
usa_counties_shp = gpd.read_file('gz_2010_us_050_00_5m.shp')
#set coordinate system
usa_counties_shp.crs = {'init':'epsg:4326'}
#load in states shapefile
usa_states_shp = gpd.read_file('cb_2017_us_state_5m.shp')
#set coordinate system
usa_states_shp.crs = {'init':'epsg:4326'}
Each county could be uniquely identified by concatenating the 'STATE' and 'COUNTY' codes.
#create unique county id
usa_counties_shp['county_id'] = usa_counties_shp['STATE'] + usa_counties_shp['COUNTY']
#view counties geodataframe
usa_counties_shp.head()
#view states geodataframe
usa_states_shp.head()
Geopandas spatial join was used function to join 'quakes_gdf' and 'usa_counties_shp' where they intersect, and then repetat with 'usa_states_shp'.
#join earthquake to county shapefile
quakes_sj = gpd.sjoin(quakes_gdf, usa_counties_shp, how = 'inner', op='within')
#drop unnecessary columns that were added
quakes_sj.drop(['index_right', 'GEO_ID', 'LSAD', 'CENSUSAREA'], axis = 1, inplace = True)
#join again to states shapefile
quakes_sj = gpd.sjoin(quakes_sj, usa_states_shp, how = 'inner', op='intersects')
#drop unnecessary columns that were added
quakes_sj.drop(['STATEFP', 'STATENS', 'AFFGEOID', 'GEOID','LSAD', 'ALAND', 'AWATER', 'index_right', 'COUNTY'], axis = 1, inplace = True)
print("\n Remaing observations after spatial join removal: ", len(quakes_sj))
The number of observations were reduced. This was likely due to earthquakes appearing in Canada, Mexixo, or at sea.
Finally, some of the columns were renamed to make use more intuitive.
quakes_sj = quakes_sj.rename(columns={"STATE": "state_no",
"NAME_left": "county",
"STUSPS": "state_abb",
"NAME_right":"state"})
quakes_sj = quakes_sj.sort_values(by='time', ascending=True)
Regional information was still missing which may be helpful. The USA is vaguely split into 5 geographic areas; West, Southwest, Midwest, Southeast and Northeast. Using data from https://www.ducksters.com/geography/us_states/us_geographical_regions.php, these were made into a dictionary and mapped to a new column; 'region'.
#state dictionary (https://www.ducksters.com/geography/us_states/us_geographical_regions.php)
statedic = {'California':'West','Washington':'West','Arkansas':'Southeast','Missouri':'Midwest','Montana':'West',
'Colorado':'West', 'Nevada':'West','Kentucky':'Southeast','Ohio':'Midwest','Alabama':'Southeast',
'Wyoming':'West','Idaho':'West','Utah':'West','Arizona':'Southwest','Tennessee':'Southeast',
'Oregon':'West','Nebraska':'Midwest','New York':'Northeast','Minnesota':'Midwest','Texas':'Southwest',
'Mississippi':'Southeast','Oklahoma':'Southwest','Virginia':'Southeast','South Carolina':'Southeast',
'New Mexico':'Southwest','West Virginia':'Southeast','Rhode Island':'Northeast','Massachusetts':'Northeast',
'New Jersey':'Northeast','Georgia':'Southeast','Illinois':'Midwest','New Hampshire':'Northeast',
'Maine':'Northeast','North Carolina':'Southeast','Maryland':'Northeast', 'Pennsylvania':'Northeast',
'Vermont':'Northeast','Connecticut':'Northeast','Kansas':'Midwest','South Dakota':'Midwest',
'North Dakota':'Midwest','Louisiana':'Southeast','Indiana':'Midwest','Michigan':'Midwest','Iowa':'Midwest',
'Delaware':'Northeast'}
#map regions to quakes
quakes_sj['region'] = quakes_sj['state'].map(statedic)
quakes_sj.head()
This was them a preprocesed, spatially accurate point geodataframe, with derived features related to county and state location. The data was useable, credible and useful. Further analysis could then be performed.
For the RFM modelling and k-means clustering to be done, 'Recency', 'Frequency' and 'Magnitude' columns were derived for each county that had experienced an earthquake since 1975. 'Recency' was how many months ago the last earthquake occured. 'Frequency' was how many earthquakes had happened in this time, and 'Magnitude' was the cumulative magnitude since 1975 for each county.
#rfm table initialisation
quakes_rfm = quakes_sj[['time', 'county_id', 'county', 'state', 'mag']]
quakes_rfm.head()
To derive 'Recency', how many month ago the earthquake was.
enddate = dt.datetime(2018, 1, 1)
quakes_rfm['recency'] = quakes_rfm['time'].apply(lambda x: (enddate.year - x.year) * 12 + enddate.month - x.month)
quakes_rfm.head()
Frequency was how many earthquakes had occured.
quakes_frequency = pd.DataFrame(quakes_rfm['county_id'].value_counts(), index = None)
quakes_frequency = quakes_frequency.reset_index(drop=False)
quakes_frequency.columns = ['county_id', 'frequency']
quakes_frequency.head()
quakes_rfm2 = pd.merge(quakes_rfm, quakes_frequency, how = 'inner', on = 'county_id')
quakes_rfm2.head()
Magnitude, was the cumulative magnitude since 1975.
quakes_rfm_table = quakes_rfm2.groupby('county_id').agg({'mag':'sum', 'recency':'min', 'frequency':'min'})
quakes_rfm_table = quakes_rfm_table.reset_index(drop=False)
quakes_rfm_table.head()
quakes_RFM = quakes_rfm_table.merge(quakes_rfm[['county_id', 'county', 'state']], on = 'county_id', how = 'right')
quakes_RFM = quakes_RFM.drop_duplicates(keep='first')
quakes_RFM = quakes_RFM.set_index('county_id')
quakes_RFM.head()
These metrics for each county could then be visualised on the plot below.
trace0 = Scatter3d(x=quakes_RFM.recency,
y=quakes_RFM.frequency,
z=quakes_RFM.mag,
mode='markers',
marker=dict(size=4, # set color to an array/list of desired values
colorscale='Viridis', # choose a colorscale
opacity=0.8),
text= (quakes_RFM['county']+", "+quakes_RFM['state'])
)
layout0 = Layout(title = 'Earthquake RFM Model',
scene = dict(xaxis = dict(title='Recency'),
yaxis = dict(title='Frequency'),
zaxis = dict(title='Magnitude'),
camera = dict( up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.78, y=-1.84, z=0.34))),
margin=dict(r=0, l=0, b=0, t=40)
)
data = [trace0]
fig4 = dict(data=data, layout=layout0)
iplot(fig4)
Clearly, the data was very skewed. This was because of the large variation between scales of the 3 metrics. To resolve this issue, the data was transformed using log normalisation and then visualised again.
#scale by log to standardise
quakes_RFM_log = quakes_RFM.applymap(lambda x: np.log(x) if isinstance(x, int) else (np.log(x) if isinstance(x, float) else x))
quakes_RFM_log.head()
trace0 = Scatter3d(x=quakes_RFM_log.recency,
y=quakes_RFM_log.frequency,
z=quakes_RFM_log.mag,
mode='markers',
marker=dict(size=4, # set color to an array/list of desired values
colorscale='Viridis', # choose a colorscale
opacity=0.8),
text= (quakes_RFM['county']+", "+quakes_RFM['state'])
)
layout0 = Layout(title = 'Earthquake RFM Model (Standardised)',
scene = dict(xaxis = dict(title='Recency'),
yaxis = dict(title='Frequency'),
zaxis = dict(title='Magnitude'),
camera = dict( up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.78, y=-1.84, z=0.34))),
margin=dict(r=0, l=0, b=0, t=40)
)
data = [trace0]
fig5 = dict(data=data, layout=layout0)
iplot(fig5)
The result of this meant that there was a much clearer and despersed set of data points. K-means clustering was then able to be performed.
K-means clustering is an unsupervised learning algorithm used to partition a dataset into a defined number of groups. It was chosed to be used for this project as it would partition the US counties in the RFM model into different groups with similar recency, frequency and monetary characteristics, which would then be used to classify counties according to seismic event risk.
The k-means algorithm was initialised with 'k-means ++' selected as a hyperparameter. This selects initial cluster centers for k-mean clustering in an optimised way to speed up convergence.
The model was run 15 times. Each time the within cluser sum of squares (WCSS) metric was calculated. The aim was to minimised this whilts using an optimim number of clusters. The elbow method was used to select the number of clusters following this.
#kmeans clustering
from sklearn.cluster import KMeans
#elbow graph
wcss = []
for i in range(1, 15):
kmeans = KMeans(n_clusters= i, init = 'k-means++', random_state=42)
kmeans.fit(quakes_RFM_log.iloc[:, 0:2])
wcss.append(kmeans.inertia_)
#plot WCSS elbow graph
trace0 = Scatter(x = np.array(range(1,15)), y = wcss, mode = "lines", name = "WCSS",
marker = dict(color = 'rgba(0, 0, 255, 1)'), text= wcss)
data = [trace0]
layout = dict(title = 'Elbow Graph',
xaxis= dict(title= 'Number of Clusters',
ticklen= 5,
zeroline= True),
yaxis = dict(title= 'WCSS'),
annotations=[dict(x=4, y=1319,
xref='x', yref='y', text='WCSS Plateau',
showarrow=True, arrowhead=7, ax=0, ay=-240)])
fig6 = dict(data = data, layout = layout)
iplot(fig6)
The elbow method indicates that the optimum number of cluseters is around 4. The kmeans algorithm was run again with this as the number of clusters. The cluste information was then assessed.
kmeans = KMeans(n_clusters= 4, init = 'k-means++', random_state=42)
clusters = kmeans.fit_predict(quakes_RFM_log.iloc[:, 0:2])
quakes_RFM_log['cluster'] = clusters
RFM_cluster0 = quakes_RFM_log[quakes_RFM_log['cluster'] == 0]
RFM_cluster1 = quakes_RFM_log[quakes_RFM_log['cluster'] == 1]
RFM_cluster2 = quakes_RFM_log[quakes_RFM_log['cluster'] == 2]
RFM_cluster3 = quakes_RFM_log[quakes_RFM_log['cluster'] == 3]
cluster_groups = [RFM_cluster0, RFM_cluster1, RFM_cluster2, RFM_cluster3]
for cluster in cluster_groups:
print("Cluster: ",cluster.cluster.min())
print(cluster.describe())
print(" ")
We can see from the cluster metrics that 'Cluster 1' has the lowest average recency, but highest average frequency and magnitudes. The counties with in this cluster are likely the 'Very High Risk' counties.
'Cluster 3' has the highest average recency, and lowest mean frequency and magnitude scores. This indicated seismic activity usually last happened longer ago than in other counties, and very infrequently, perhaps just 'one-off' events. These are 'Low Risk' Counties.
Of the remaining two clusters, 'Cluster 0' has lower average magnitude and frequency, but lower recency to. These are counties that have had activity more recently in cases, but not very active in general- 'Moderate Risk'. For 'Cluster 2', although the average recency is slightly higher that of 'Cluster 3', it still has had higher average magnitude and frequency scores, showing more seismic activity. These are therefore 'High Risk' counties.
Counties were assigned these new labels.
#assign new label risk names
risk = {0:'Moderate Risk', 1: 'Very High Risk', 2:'High Risk', 3:'Low Risk'}
#map new names to dataframe
quakes_RFM_log['Risk Factor'] = quakes_RFM_log['cluster'].map(risk)
quakes_RFM_log.tail()
quakes_RFM_log.head()
After assingning the clusters the new labels, the final results could agaom be visualised on a graph.
RFM_cluster_VH = quakes_RFM_log[quakes_RFM_log['Risk Factor'] == 'Very High Risk']
RFM_cluster_H = quakes_RFM_log[quakes_RFM_log['Risk Factor'] == 'High Risk']
RFM_cluster_M = quakes_RFM_log[quakes_RFM_log['Risk Factor'] == 'Moderate Risk']
RFM_cluster_L = quakes_RFM_log[quakes_RFM_log['Risk Factor'] == 'Low Risk']
trace1 = Scatter3d(x=RFM_cluster_VH.recency,
y=RFM_cluster_VH.frequency,
z=RFM_cluster_VH.mag,
mode='markers',
name = 'Very High Risk',
marker=dict(size=4, # set color to an array/list of desired values
colorscale='Viridis', # choose a colorscale
opacity=0.8,
color = 'rgba(255, 0, 0, 0.6)'),
text= (RFM_cluster_VH['county']+", "+RFM_cluster_VH['state'])
)
trace2 = Scatter3d(x=RFM_cluster_H.recency,
y=RFM_cluster_H.frequency,
z=RFM_cluster_H.mag,
mode='markers',
name = 'High Risk',
marker=dict(size=4, # set color to an array/list of desired values
colorscale='Viridis', # choose a colorscale
opacity=0.8,
color = 'rgba(255, 128, 0, 0.6)'),
text= (RFM_cluster_H['county']+", "+RFM_cluster_H['state'])
)
trace3 = Scatter3d(x=RFM_cluster_M.recency,
y=RFM_cluster_M.frequency,
z=RFM_cluster_M.mag,
mode='markers',
name = 'Moderate Risk',
marker=dict(size=4, # set color to an array/list of desired values
colorscale='Viridis', # choose a colorscale
opacity=0.8,
color = 'rgba(204, 204, 0, 0.6)'),
text= (RFM_cluster_M['county']+", "+RFM_cluster_M['state'])
)
trace4 = Scatter3d(x=RFM_cluster_L.recency,
y=RFM_cluster_L.frequency,
z=RFM_cluster_L.mag,
mode='markers',
name = 'Low Risk',
marker=dict(size=4, # set color to an array/list of desired values
colorscale='Viridis', # choose a colorscale
opacity=0.8,
color = 'rgba(0, 128, 0, 0.6)'),
text= (RFM_cluster_L['county']+", "+RFM_cluster_L['state'])
)
layout0 = Layout(title = 'Earthquake RFM Model (Standardised)',
scene = dict(xaxis = dict(title='Recency'),
yaxis = dict(title='Frequency'),
zaxis = dict(title='Magnitude'),
camera = dict( up=dict(x=0, y=0, z=1),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.78, y=-1.84, z=0.34))),
margin=dict(r=0, l=0, b=0, t=40)
)
data = [trace1, trace2, trace3, trace4]
fig7 = dict(data=data, layout=layout0)
iplot(fig7)
The k-means clustering results of the earthquake RFM data can be seen in the figure above. We can very clearly see the differences between different risk categoried. Hovering over points on the graph allows information on the counties to be viewed. Lots of the Very High Risk counties are are around California and some other states.
Model evaluation and performance could then be carried out.
To evaluate the model, earthquake data from the first ten months of 2018. Using the classification outputs from the k-means clustering, the propotion of earthquakes occuring in each of the risk classes was calulated.
It was decided that for the model to be seen as useful; 80% of 2018 earthquakes should occur in 'Very High Risk' counties, 10% should occur in 'High Risk' counties, and the rest should be evenly split between the other two categories.
The earthquakes in 2018 were isolated and previous spatial operations performed again.
#using initial quakes_sj
#trim date range of data
quakes18 = df5[(df5['time'].dt.year >= 2018) & (df5['time'].dt.month <= 10)]
#order dataframe by time
quakes18 = quakes18.sort_values(by='time', ascending=True)
#print first date
print("First Earthquake: ", quakes18['time'].min())
#print last data
print("Last Earthquake: ", quakes18['time'].max())
quakes18['coordinates'] = list(zip(quakes18.longitude, quakes18.latitude))
#convert to spatial points
quakes18['coordinates'] = quakes18['coordinates'].apply(Point)
#define geodataframe
quakes18_gdf = gpd.GeoDataFrame(quakes18, geometry = 'coordinates')
#set projection system
quakes18_gdf.crs = {'init':'epsg:4326'}
#join earthquake to county shapefile
quakes18_sj = gpd.sjoin(quakes18_gdf, usa_counties_shp, how = 'inner', op='within')
#drop unnecessary columns that were added
quakes18_sj.drop(['index_right', 'GEO_ID', 'LSAD', 'CENSUSAREA'], axis = 1, inplace = True)
usa_counties_cluster = pd.merge(usa_counties_shp, quakes_RFM_log, left_on= 'county_id', right_index=True, how = 'inner')
#usa_counties_cluster.to_file('usa_counties_cluster_2.shp', driver='ESRI Shapefile')
The counties from the RMF model were separated into different dataframes.
usa_counties_cluster_HHR = usa_counties_cluster[usa_counties_cluster['Risk Factor'] == "Very High Risk"]
usa_counties_cluster_HR = usa_counties_cluster[usa_counties_cluster['Risk Factor'] == "High Risk"]
usa_counties_cluster_MR = usa_counties_cluster[usa_counties_cluster['Risk Factor'] == "Moderate Risk"]
usa_counties_cluster_LR = usa_counties_cluster[usa_counties_cluster['Risk Factor'] == "Low Risk"]
quakes18_sj_clusts = gpd.sjoin(quakes18_sj, usa_counties_cluster, how = 'inner', op='within')
quakes18_sj_clusts.head()
The total number of earthquakes within the USA county boundaries are shown below.
#all earthquakes WITHIN usa county boundaries
quakes_sj_all = gpd.sjoin(quakes18_sj, usa_counties_shp, how = 'inner', op='within')
len(quakes_sj_all)
Then, the number if these earthquakes occuring in each of the risk categories was calculated.
print("Total number of 2018 earthquakes: ", len(quakes_sj_all))
print("Number of 2018 earthquakes in 'Very High Risk' counties: ", quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'Very High Risk'].time.count())
print("Number of 2018 earthquakes in 'High Risk' counties: ", quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'High Risk'].time.count())
print("Number of 2018 earthquakes in 'Moderate Risk' counties: ", quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'Moderate Risk'].time.count())
print("Number of 2018 earthquakes in 'Low Risk' counties: ", quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'Low Risk'].time.count())
print("Number of 2018 earthquakes in new counties: ", ((len(quakes_sj_all)) - (quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'Very High Risk'].time.count()) - (quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'High Risk'].time.count()) - (quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'Moderate Risk'].time.count()) - (quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'Low Risk'].time.count())))
The proportional breakdown can be viewed below.
fig31 = {"data": [{"values": [quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'Very High Risk'].time.count(),
quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'High Risk'].time.count(),
quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'Moderate Risk'].time.count(),
quakes18_sj_clusts[quakes18_sj_clusts['Risk Factor'] == 'Low Risk'].time.count(),
15],
"labels": ["Very High Risk", "High Risk", "Moderate Risk", "Low Risk","New Counties"],
"domain": {"x": [0, .48]},
"name": "2018 Earthquake RFM Model Assessment",
"hoverinfo":"label+percent+name",
"hole": .4,
"type": "pie",
'marker': {'colors': ['rgb(255, 0, 0)', 'rgb(255, 128, 0)', 'rgb(204, 204, 0)', 'rgb(0, 128, 0)','rgb(211,211,211)']},
}
],
"layout": {"title":"2018 Earthquake RFM Model Assessment",
}
}
iplot(fig31)
These results show that the model performed well with regards to the evaluation requirements at the start of 2.1.2.4. It is shown that the model exceded expectations, as 84.8% of the earthquakes occured in 'Very High Risk' counties. The 'High Risk' county requirement was not met as this was 7.88% rater than 10% - however, this is attributable to the stronger than expected performance of the 'Very High Risk'. The remaining categories performed well as desired. 1.05% of earthquakes occurred in unclassified counties. This is a slight limitation in the model, and suggests that any counties that are not classifies after RFM clustering should be assigned to 'Low Risk' counties.
Following this analysis, it was deemed useful to assess the spatial distribution of these counties to see if this gave us any more information about spatial distribution of the RFM clustering.
A chloropleth map shows the location and 'Risk Factor' of each of the counties below.
n = folium.Map([39.039809, -95.493028 ], zoom_start=4, tiles='cartodbpositron')
folium.Choropleth(usa_counties_cluster_HHR, name = 'Very High Risk', fill_color='Red',line_color='Red',highlight=True).add_to(n)
folium.Choropleth(usa_counties_cluster_HR, name = 'High Risk', fill_color='Orange',line_color='Orange',highlight=True).add_to(n)
folium.Choropleth(usa_counties_cluster_MR, name = 'Moderate Risk', fill_color='Yellow',line_color='Yellow',highlight=True).add_to(n)
folium.Choropleth(usa_counties_cluster_LR, name = 'Low Risk', fill_color='Green',line_color='Green',highlight=True).add_to(n)
# add the layer control
folium.LayerControl(collapsed=False).add_to(n)
n
The map shows a few things. 'Very High Risk' counties are mostly around the West of the USA. Interestingly, there are small pockets of VHR counties around Oklahma and some near Tenessee and Arkansas. The VHR counties are generally near HR counties, with Moderate and Low risk counties dispersed more widely.
Time series analysis was then carried out to investigate these patterns further.
A graph showing the annual number of earthquakes per year since 1975 was plotted.
#new dataframe for annual
annual_quakes = quakes_sj
#convert 'time' to datetime
annual_quakes['time'] = pd.to_datetime(annual_quakes['time'], errors='coerce', unit='s')
#set index to time for time series
annual_quakes = annual_quakes.set_index('time')
yearly_quakes = annual_quakes.resample('Y').count()
#plot annual earthquakes
trace0 = Scatter(x = yearly_quakes.index,
y = yearly_quakes.latitude,
mode = "lines",
name = "earthquakes",
marker = dict(color = 'rgba(0, 0, 0, 1)'),
text= yearly_quakes.index.year)
data = [trace0]
layout = dict(title = 'USA Annual Earthquake Count (1975 - 2017)',
xaxis= dict(title= 'Year',
ticklen= 5,
zeroline= False),
yaxis= dict(title= 'Earthquakes'),
annotations=[dict(x=pd.to_datetime('1993-01-01 00:00:00.00'), y=6989,
xref='x', yref='y', text='California 1992 Earthquakes',
showarrow=True, arrowhead=7, ax=150, ay=0)]
)
fig0 = dict(data = data, layout = layout)
iplot(fig0)
It appears that the annual count of earthquakes across the USA is rather volatile and doesn't show much of a pattern or trend. There is a large peak in 1992. This is attributable to a spate of earthquakes in California that year. Recemy trends show that there has been a rise and quick fall since 2012.
However, the USA is a large country, and it would be interesting to see if these was consistent across all states.
quakes_ann_west = quakes_sj[quakes_sj['region'] == 'West'].set_index('time').resample('Y').count()
quakes_ann_midwest = quakes_sj[quakes_sj['region'] == 'Midwest'].set_index('time').resample('Y').count()
quakes_ann_northeast = quakes_sj[quakes_sj['region'] == 'Northeast'].set_index('time').resample('Y').count()
quakes_ann_southeast = quakes_sj[quakes_sj['region'] == 'Southeast'].set_index('time').resample('Y').count()
quakes_ann_southwest = quakes_sj[quakes_sj['region'] == 'Southwest'].set_index('time').resample('Y').count()
#plot annual earthquakes
trace0 = Scatter(x = yearly_quakes.index, y = yearly_quakes.latitude, mode = "lines", name = "USA Total",
marker = dict(color = 'rgba(0, 0, 0, 0.2)'), line = dict(dash = 'dash'), text= yearly_quakes.index.year)
trace1 = Scatter(x = quakes_ann_west.index,y = quakes_ann_west.latitude,mode = "lines",name = "West",
marker = dict(color = 'rgba(255, 0, 0, 1)'),text= quakes_ann_west.index.year)
trace2 = Scatter(x = quakes_ann_midwest.index,y = quakes_ann_midwest.latitude,mode = "lines",name = "Midwest",
marker = dict(color = 'rgba(0, 128, 0, 1)'), text= quakes_ann_midwest.index.year)
trace3 = Scatter(x = quakes_ann_northeast.index,y = quakes_ann_northeast.latitude, mode = "lines",name = "Northeast",
marker = dict(color = 'rgba(255, 128, 0, 1)'),text= quakes_ann_northeast.index.year)
trace4 = Scatter(x = quakes_ann_southeast.index,y = quakes_ann_southeast.latitude, mode = "lines",name = "Southeast",
marker = dict(color = 'rgba(0, 0, 225, 1)'),text= quakes_ann_southeast.index.year)
trace5 = Scatter(x = quakes_ann_southwest.index, y = quakes_ann_southwest.latitude, mode = "lines",name = "Southwest",
marker = dict(color = 'rgba(204, 204, 0, 1)'),text= quakes_ann_southwest.index.year)
data = [trace0, trace1, trace2, trace3, trace4, trace5]
layout = dict(title = 'USA Annual Earthquake Count by Region (1975 - 2017)',
xaxis= dict(title= 'Year',
ticklen= 5,
zeroline= False),
yaxis= dict(title= 'Earthquakes'),
annotations=[dict(x=pd.to_datetime('2014-01-01 00:00:00.00'), y=330,
xref='x', yref='y', text='Sharp rise in<br>Southwest earthquakes',
showarrow=True, arrowhead=7, ax=0, ay=-240)]
)
fig2 = dict(data = data, layout = layout)
iplot(fig2)
We can now clearly see that there is significant regional variation in the amount of earthquakes across the USA. It appears that the main driver of earthquake counts is the events in the West, as this mimics the pattern of the USA. However, what is interesting is that earthquakes in other regions change patterns after 2010, especially in the Southwest, where there is a sharp rise, to 2,880 earthquakes recently.
A closer look was taken to see what was going on in each state in the Southwest.
#isolate earthquakes in southwest only
quakes_sj_southwest = quakes_sj[quakes_sj['region'] == 'Southwest']
quakes_sj_southwest['state'].unique()
quakes_ann_arizona = quakes_sj_southwest[quakes_sj_southwest['state'] == 'Arizona'].set_index('time').resample('Y').count()
quakes_ann_texas = quakes_sj_southwest[quakes_sj_southwest['state'] == 'Texas'].set_index('time').resample('Y').count()
quakes_ann_oklahoma = quakes_sj_southwest[quakes_sj_southwest['state'] == 'Oklahoma'].set_index('time').resample('Y').count()
quakes_ann_newmexico = quakes_sj_southwest[quakes_sj_southwest['state'] == 'New Mexico'].set_index('time').resample('Y').count()
#plot annual earthquakes
trace0 = Scatter(x = quakes_ann_arizona.index, y = quakes_ann_arizona.latitude,mode = "lines",name = "Arizona",
marker = dict(color = 'rgba(102, 255, 255, 0.4)'), text= quakes_ann_arizona.index.year)
trace1 = Scatter(x = quakes_ann_texas.index, y = quakes_ann_texas.latitude, mode = "lines", name = "Texas",
marker = dict(color = 'rgba(255, 0, 0, 1)'), text= quakes_ann_texas.index.year)
trace2 = Scatter(x = quakes_ann_oklahoma.index, y = quakes_ann_oklahoma.latitude, mode = "lines", name = "Oklahoma",
marker = dict(color = 'rgba(0, 128, 0, 1)'), text= quakes_ann_oklahoma.index.year)
trace3 = Scatter(x = quakes_ann_newmexico.index, y = quakes_ann_newmexico.latitude, mode = "lines", name = "New Mexico",
marker = dict(color = 'rgba(255, 128, 0, 1)'), text= quakes_ann_newmexico.index.year)
data = [trace0, trace1, trace2, trace3]
layout = dict(title = 'Southeast USA Annual Earthquake Count by State (1975 - 2017)',
xaxis= dict(title= 'Year',
ticklen= 5,
zeroline= False))
fig3 = dict(data = data, layout = layout)
iplot(fig3)
It is clear from the figure above, that the main driver of earthquakes in the Southeast states is Oklahoma. This, is abnormal not only because Oklahoma is not near any major fault lines, but also because of the sharp rise in the amount of earthquakes since 2010.
The spatio-temporal patterns are not best visualised using static graphs. The animations below show the patterns of earthquake activity from 1975 to 2017.
#Extract the month column from pickup datetime variable and take subset of data
quakes_sj['year'] = quakes_sj['time'].dt.year
heat_df = quakes_sj[['latitude', 'longitude','year']]
# Ensure you're handing it floats
heat_df['latitude'] = heat_df['latitude'].astype(float)
heat_df['longitude'] = heat_df['longitude'].astype(float)
heat_df
# Create weight column, using date
heat_df['Weight'] = heat_df['year']
heat_df['Weight'] = heat_df['Weight'].astype(float)
heat_df = heat_df.dropna(axis=0, subset=['latitude','longitude', 'Weight'])
quakes_on_heatmap = folium.Map([39.039809, -95.493028 ], zoom_start=4, tiles='cartodbpositron')
# List comprehension to make out list of lists
heat_data = [[[row['latitude'],row['longitude']]
for index, row in heat_df[heat_df['Weight'] == i].iterrows()]
for i in range(1975,2017)]
# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data,auto_play=True, max_opacity=1, radius=5)
folium.CircleMarker(location=[36, -97.5],radius=35,fill=True,
color = 'black',fill_opacity = 0.1, popup=folium.Popup('Oklahoma High Activity from 2010')).add_to(quakes_on_heatmap)
hm.add_to(quakes_on_heatmap)
quakes_on_heatmap
From the animation, it can be observed that the patterns are genrally consistent year on year. Most occur in the West side of the country. The anomoly of Oklahoma has been circled. The animation further reinforces the evidence of high amounts of irregular seismic activity in the past few years.
A closer look is shown below.
quakes_on_heatmap = folium.Map([35.5078, -98.0929], zoom_start=7, tiles='cartodbpositron')
# List comprehension to make out list of lists
heat_data = [[[row['latitude'],row['longitude']]
for index, row in heat_df[heat_df['Weight'] == i].iterrows()]
for i in range(1975,2017)]
# Plot it on the map
hm = plugins.HeatMapWithTime(heat_data,auto_play=True,
max_opacity=0.8,
radius=10)
hm.add_to(quakes_on_heatmap)
quakes_on_heatmap
This analysis has shown Oklahoma that Oklahoma exhibits anomolous behaviour to the rest of the USA, meaning Oklahoma will be a focus of investigation throughout this project.
Following outputs from the RFM K-means clustering and spatio-temporal analysis, some interesting patterns have been observed. In particular, the spate of earthquakes in Oklahoma since 2010 means it is classified as a 'Very High Risk' county in terms of seismicity, according to the model.
The common scientific consensus is that the increased amount of earthquakes since 2011 is due to 'Underground Injection Control' (UIC). UIC is a process whereby fluid is injected into the ground at high volume and pressure. It is commonly used in wastewater disposal and fracking.
It was decided that the project would investigate the relationship between UIC Volume (the amount of fluid injected into the ground), UIC pressure(the pressure of fluid injection) and seismic activity in Oklahoma in the last decade.
To do this, we will use the earthquake data as well as new UIC data containing information of the volume and pressure of underground fluid injection, to investigate if there is any relationship between the two phenomena using linear regression.
Publically available UIC data was downloaded from the internet and stored in the folder below.
#view files in folder
datafolder = r'C:\Users\aczd067\OD\INM430-PriciplesOfDataScience\Fracking\Data\UIC-Volumes'
os.chdir(datafolder)
files = os.listdir(datafolder)
files
Each excel file contained annual data about every UIC drilling site in Oklahoma. Importantly, it contains data about the monthly fluid injection volumes and pressure. The code below extracts the monthly injection volumes at each site in 2017.
#read 2017 excel
uic2017 = pd.read_excel('UIC injection volumes 2017.xlsx',
sheet_name = 'Sheet1',
usecols = 'A:BN',
header = [0])
#drop duplicate drill observations
uic2017.drop_duplicates(subset = 'API', keep = 'first', inplace = True)
#keep only important descriptive columns
columns = ['API', 'Lat_Y', 'Long_X', 'CountyName']
#find columns related to 'volume', add to 'columns' list
volcols = [col for col in uic2017.columns if 'Vol' in col];
for name in volcols:
columns.append(name)
#new dataframe with only column names in 'columns' list
uic2017_Vol = uic2017[columns]
#add year to columns
for col in uic2017_Vol.columns.values:
if 'Vol' in col:
col1 = col.replace("Vol", "2017")
uic2017_Vol = uic2017_Vol.rename(columns = {col:col1})
uic2017_Vol.head()
A new dataframe, 'Annual Volumes', was made from the 2017 data. This will form the main dataframe for other years to be appended to.
#make this the main volumes dataframe for other years data to be appended to
Annual_volumes = uic2017_Vol
#remove 2017 file from file list
files.remove('UIC injection volumes 2017.xlsx')
The script below performs the same wrangling and processing as above for each remaining year in the folder, and merges into the main 'Annual_volumes' dataset.
#loop through list of files, wrangle each and add to 'Annual Volumes'
for file in files:
df = pd.read_excel(file,
sheet_name = 'Sheet1',
usecols = 'A:BN',
header = [0])
#drop duplicates
df.drop_duplicates(subset = 'API', keep = 'first', inplace = True)
#keep only important columns
columns = ['API', 'Lat_Y', 'Long_X', 'CountyName']
volcols = [col for col in df.columns if 'Vol' in col];
for name in volcols:
columns.append(name)
df_Vol = df[columns]
#add year to columns
for col in df_Vol.columns.values:
if 'Vol' in col:
col1 = col.replace("Vol", file[-9:-5])
df_Vol = df_Vol.rename(columns = {col:col1})
#merge each df to main 'Annual_volumes' df after each loop
Annual_volumes = pd.merge(Annual_volumes, df_Vol, how ='outer')
Annual_volumes.head()
Every column in the dataframe represents the monthly volumes of fluid injection by UIC sites in Oklahoma from 2011 to 2017.
These were made into datetime variables in the following steps.
Annual_volumes_dt = Annual_volumes
#string to datetime helper function
def dt_format(input_str, input_date_format, return_date_format):
return dt.datetime.strptime(input_str, input_date_format).strftime(return_date_format)
#make each month column name a datetime
for col in Annual_volumes_dt.columns.values[4:]:
col1 = (dt_format(col, '%b %Y', '%Y-%m'))
Annual_volumes_dt = Annual_volumes_dt.rename(columns = {col:col1})
Annual_volumes_dt.head()
The same process can now be carried out to find the pressure at which each UIC site injects fluid into the ground from 2011 to 2017.
#files
files = os.listdir(datafolder)
#read 2017 excel
uic2017 = pd.read_excel('UIC injection volumes 2017.xlsx',
sheet_name = 'Sheet1',
usecols = 'A:BN',
header = [0])
#drop duplicates
uic2017.drop_duplicates(subset = 'API', keep = 'first', inplace = True)
#keep only important descriptive columns
columns = ['API', 'Lat_Y', 'Long_X', 'CountyName', 'TotalDepth', 'FormationName']
#find columns related to 'PSI', add to 'columns' list
psicols = [col for col in uic2017.columns if 'PSI' in col];
for name in psicols:
columns.append(name)
#new dataframe with only column names in 'columns' list
uic2017_PSI = uic2017[columns]
#add year to columns
for col in uic2017_PSI.columns.values:
if 'PSI' in col:
col1 = col.replace("PSI", "2017")
uic2017_PSI = uic2017_PSI.rename(columns = {col:col1})
#make this the main pressure dataframe for other years data to be appended to
Annual_PSI = uic2017_PSI
#remove 2017 from files list
files.remove('UIC injection volumes 2017.xlsx')
#loop through list of files, perform above steps on each
for file in files:
df = pd.read_excel(file,
sheet_name = 'Sheet1',
usecols = 'A:BN',
header = [0])
#drop duplicates
df.drop_duplicates(subset = 'API', keep = 'first', inplace = True)
#keep only important columns
columns = ['API', 'Lat_Y', 'Long_X', 'CountyName']
psicols = [col for col in df.columns if 'PSI' in col];
for name in psicols:
columns.append(name)
df_PSI = df[columns]
#add year to columns
for col in df_PSI.columns.values:
if 'PSI' in col:
col1 = col.replace("PSI", file[-9:-5])
df_PSI = df_PSI.rename(columns = {col:col1})
#merge each df to 'base' df after each processing
Annual_PSI = pd.merge(Annual_PSI, df_PSI, how ='outer')
Annual_PSI_dt = Annual_PSI
#make each column name datetime
#helper function
def dt_format(input_str, input_date_format, return_date_format):
return dt.datetime.strptime(input_str, input_date_format).strftime(return_date_format)
for col in Annual_PSI_dt.columns.values[6:]:
col1 = (dt_format(col, '%b %Y', '%Y-%m'))
Annual_PSI_dt = Annual_PSI_dt.rename(columns = {col:col1})
Annual_PSI_dt.head()
This now gave two dataframes. One containing montly injection volumes, and one containing monthly injection pressure for each UIC site in Oklahoma from 2011 to 2017 for use in analysis.
The first step was to assess the monthly UIC volume and pressure in Oklahoma, and compare it to monthly seismic activity.
The individual UIC site volumes were aggregated to find monthly UIC volume in Oklahoma.
Annual_volumes_dt_t = Annual_volumes_dt.iloc[:,4:].T
Annual_volumes_dt_t = Annual_volumes_dt_t.aggregate(func = 'sum', axis= 1)
Annual_volumes_dt_t = pd.DataFrame(Annual_volumes_dt_t, columns = ['Total Volume'])
Annual_volumes_dt_t = Annual_volumes_dt_t.sort_index()
The individual UIC site pressure amounts were aggregated to find monthly UIC injection pressure in Oklahoma.
Annual_psi_dt_t = Annual_PSI_dt.iloc[:,6:].T
Annual_psi_dt_t = Annual_psi_dt_t.aggregate(func = 'sum', axis= 1)
Annual_psi_dt_t = pd.DataFrame(Annual_psi_dt_t, columns = ['Total PSI'])
Annual_psi_dt_t = Annual_psi_dt_t.sort_index()
Earthquakes in Oklahoma were retrived from the earthquake dataset, and then aggregated and resampled into monthly periods.
quakes_11_17 = quakes_sj[(quakes_sj['time'].dt.year >= 2011) & (quakes_sj['time'].dt.year <= 2017)]
quakes_11_17_ok = quakes_11_17[(quakes_11_17['state'] == 'Oklahoma')]
quakes_11_17_ok_m = quakes_11_17_ok.set_index('time')
quakes_11_17_ok_m = quakes_11_17_ok_m.resample("m").agg('sum')
Monthly UIC volume and monthly UIC pressure were plotted against monthly earthquakes in Oklahoma.
trace0 = Scatter(x = Annual_volumes_dt_t.index, y = Annual_volumes_dt_t['Total Volume'],mode = "lines",name = "UIC Volume",
marker = dict(color = 'rgba(0, 0, 255, 1)'), text= Annual_volumes_dt_t.index)
trace1 = Scatter(x = quakes_11_17_ok_m.index, y = quakes_11_17_ok_m['mag'],mode = "lines",name = "Total Magnitude", yaxis='y2',
marker = dict(color = 'rgba(255, 0, 0, 1)'), text= quakes_11_17_ok_m.index)
data = [trace0, trace1]
layout = Layout(title='Oklahoma UIC Volume and Magnitude (2011-2017)',
xaxis= dict(title= 'Year',
ticklen= 1,
zeroline= False),
yaxis=dict(title='Volume',
titlefont=dict(color='rgba(0, 0, 255, 0.5)'),
tickfont=dict(color='rgba(0, 0, 255, 0.5)'),
zeroline= False),
yaxis2=dict(title='Magnitude',
titlefont=dict(color='rgba(255, 0, 0, 0.5)'),
tickfont=dict(color='rgba(255, 0, 0, 0.5)'),
zeroline= False,
overlaying='y',
side='right'),
)
fig10 = dict(data = data, layout = layout)
iplot(fig10)
trace0 = Scatter(x = Annual_psi_dt_t.index, y = Annual_psi_dt_t['Total PSI'],mode = "lines",name = "UIC PSI",
marker = dict(color = 'rgba(225, 225, 0, 1)'), text= Annual_psi_dt_t.index)
trace1 = Scatter(x = quakes_11_17_ok_m.index, y = quakes_11_17_ok_m['mag'],mode = "lines",name = "Total Magnitude", yaxis='y2',
marker = dict(color = 'rgba(255, 0, 0, 1)'), text= quakes_11_17_ok_m.index)
data = [trace0, trace1]
layout = Layout(title='Oklahoma UIC Pressure and Magnitude (2011-2017)',
xaxis= dict(title= 'Year',
ticklen= 1,
zeroline= False),
yaxis=dict(title='Pressure',
titlefont=dict(color='rgba(225, 225, 0, 1)'),
tickfont=dict(color='rgba(225, 225, 0, 1)'),
zeroline= False),
yaxis2=dict(title='Magnitude',
titlefont=dict(color='rgba(255, 0, 0, 0.5)'),
tickfont=dict(color='rgba(255, 0, 0, 0.5)'),
zeroline= False,
overlaying='y',
side='right'),
)
fig11 = dict(data = data, layout = layout)
iplot(fig11)
It appears that there does seem to be some kind of relationship between both UIC injection volume and pressure, and earthquake acitivity.
Both total volume and total pressure increased in the period of 2011 to 2015, before generally declining post-2016.
Total monthly magnitude increases, but at a slower rate, until 2014 where activity increases exponentially. It begins to decline post-2016 in a similar way to UIC activity.
This seems to indicate that there is a relationship between UIC activity and seismic activity that is worth investigating.
Regression analysis is used to find the strength of a relationship between two variables. It is a suitable model to use for this project as we would like to find out if there is a conclusive relationship between underground fluid injection used in the fracking process, and seismic activity.
This was done by obtaining the total monthly volume and pressure totals for each county that experienced seismic activity, and performing regression.
First, earthquake data was transformed to find tptal earthquake magnitude for each county each month.
#get quakes per month in each county
quakes_11_17_ok_melt = quakes_11_17_ok.melt(id_vars = ['time', 'county'],value_vars = ['mag'], value_name = 'magnitude')
quakes_11_17_ok_melt['time'] = quakes_11_17_ok_melt['time'].map(lambda x: x.strftime('%Y-%m'))
quakes_11_17_ok_melt = quakes_11_17_ok_melt.set_index(['time', 'county'])
quakes_11_17_ok_melt = quakes_11_17_ok_melt.groupby(['time', 'county']).sum()
qua_ri = quakes_11_17_ok_melt.reset_index()
qua_ri['county'] = qua_ri['county'].apply(lambda x: x.upper())
qua_ri = qua_ri.set_index(['time', 'county'])
qua_ri.head()
This process was also done for monthly UIC volumes, as well as UIC pressure.
#plot monthly magnitude against volumes
#monthly volumes
volmelt = Annual_volumes_dt
volmelt1 = volmelt.melt(id_vars = ['API', 'Lat_Y', 'Long_X', 'CountyName'], value_name = 'Volume')
volmelt2 = volmelt1.set_index(['variable', 'CountyName'])
volmelt3 = volmelt2.groupby(level=[0,1]).sum()
vol_ri = volmelt3.reset_index()
vol_ri = vol_ri.rename(columns = {'variable':'time', 'CountyName':'county'})
vol_ri = vol_ri.set_index(['time', 'county'])
vol_ri.head()
#do the same for pressure
#plot monthly magnitude against pressure
#monthly volumes
psimelt = Annual_PSI_dt
psimelt1 = psimelt.melt(id_vars = ['API', 'Lat_Y', 'Long_X', 'CountyName', 'TotalDepth', 'FormationName'], value_name = 'PSI')
psimelt2 = psimelt1.set_index(['variable', 'CountyName'])
psimelt3 = psimelt2.groupby(level=[0,1]).sum()
psimelt3.head()
psi_ri = psimelt3.reset_index()
psi_ri = psi_ri.rename(columns = {'variable':'time', 'CountyName':'county'})
psi_ri = psi_ri.set_index(['time', 'county'])
#vol_ri = vol_ri.drop(['API', 'Lat_Y', 'Long_X'])
psi_ri.head()
An inner join between monthly volume and monthly magnitude, and monthly pressure and monthly volume meant that only relevant counties was be considered in analysis.
#merge volumes and quakes
volmag = pd.merge(vol_ri, qua_ri, left_index=True, right_index=True)
volmag.head()
#merge pressure and quakes
psimag = pd.merge(psi_ri, qua_ri, left_index=True, right_index=True)
psimag.head()
Each county was plotted on a scatter graph to assess the relationship between the variables.
trace1 = Scatter(x = volmag.Volume,
y = volmag.magnitude,
mode='markers',
marker=dict(size=8,
color = 'rgba(255, 0, 0, 0.3)'), #set color equal to a variable
)
layout = Layout(title= 'Volume & Magnitude per County (2011-2017)',
xaxis= dict(title= 'Magnitude',ticklen= 0.5, zeroline= True, gridwidth= 0.5),
yaxis=dict(title= 'Volume',ticklen= 0.5, gridwidth= 0.5))
data = [trace1]
fig000 = dict(data = data, layout = layout)
iplot(fig000)
trace1 = Scatter(x = psimag.PSI,
y = psimag.magnitude,
mode='markers',
marker=dict(size=8,
color = 'rgba(225, 225, 0, 0.3)'), #set color equal to a variable
)
layout = Layout(title= 'Pressure & Magnitude per County (2011-2017)',
xaxis= dict(title= 'Pressure',ticklen= 0.5, zeroline= True, gridwidth= 0.5),
yaxis=dict(title= 'Volume',ticklen= 0.5, gridwidth= 0.5))
data = [trace1]
fig333 = dict(data = data, layout = layout)
iplot(fig333)
It is clear that the data is skewed for both variables. Log normalisation was performed to fix this.
#log normalise vol
volmag_log = volmag.applymap(lambda x: np.log(x) if isinstance(x, int) else (np.log(x) if isinstance(x, float) else x))
trace1 = Scatter(x = volmag_log.Volume,
y = volmag_log.magnitude,
mode='markers',
marker=dict(size=8,
color = 'rgba(255, 0, 0, 0.3)'), #set color equal to a variable
)
layout = Layout(title= 'Normalised Volume & Magnitude per County (2011-2017)',
xaxis= dict(title= 'Volume',ticklen= 0.5, zeroline= True, gridwidth= 0.5),
yaxis=dict(title= 'Magnitude',ticklen= 0.5, gridwidth= 0.5))
data = [trace1]
fig2222 = dict(data = data, layout = layout)
iplot(fig2222)
#log normalise psi
psimag_log = psimag.applymap(lambda x: np.log(x) if isinstance(x, int) else (np.log(x) if isinstance(x, float) else x))
trace1 = Scatter(x = psimag_log.PSI,
y = psimag_log.magnitude,
mode='markers',
marker=dict(size=8,
color = 'rgba(225, 225, 0, 0.3)'), #set color equal to a variable
)
layout = Layout(title= 'Normalised Pressure & Magnitude per County (2011-2017)',
xaxis= dict(title= 'Pressure',ticklen= 0.5, zeroline= True, gridwidth= 0.5),
yaxis=dict(title= 'Magnitude',ticklen= 0.5, gridwidth= 0.5))
data = [trace1]
fig1111 = dict(data = data, layout = layout)
iplot(fig1111)
The data was then in a usable state for regression analysis to be performed.
Both Pearson and Spearman regression were carried out to see if there was any significant difference between the methods.
# First do a Pearson correlation
import scipy.stats as stats
corrPearson, pVal1 = stats.pearsonr(volmag_log.Volume, volmag_log.magnitude)
print("Correlation Pearson: ", corrPearson, pVal1)
# followed by a Spearman
corrSpearman, pVal2 = stats.spearmanr(volmag_log.Volume, volmag_log.magnitude)
print ("Correlation Spearman: ", corrSpearman, pVal2)
Pearson appeared to perform slighty better, likely because it is more suited to capturing linear relationships. Linear regression using the statsmodels package was carried out, and the key values returned.
slope, intercept, r_value, p_value, std_err = stats.linregress(volmag_log.Volume, volmag_log.magnitude)
evaluatedLine = np.polyval([slope, intercept], volmag_log.Volume)
print ("Slope: ", slope)
print ("Intercept: ", intercept)
print ("p_value: ", p_value)
print ("std_err: ", std_err)
print ("Rsq: ", r_value**2)
These were used to construct a regression line below.
line = slope*volmag_log.Volume+intercept
trace1 = Scatter(x = volmag_log.Volume,
y = volmag_log.magnitude,
mode='markers',
marker=dict(size=8, color = 'rgba(255, 0, 0, 0.3)')
)
trace2 = Scatter(x=volmag_log.Volume,
y=line,
mode='lines',
marker=Marker(color='rgb(0, 0, 0)'),
name='Regression Line')
layout = Layout(title= 'Volume & Magnitude Regression (2011-2017)',
xaxis= dict(title= 'Volume',ticklen= 0.5, zeroline= True, gridwidth= 0.5),
yaxis=dict(title= 'Magnitude',ticklen= 0.5, gridwidth= 0.5))
data = [trace1, trace2]
fig2222 = dict(data = data, layout = layout)
iplot(fig2222)
These outputs show that there is generally a positive correlation between the volume of fluid injected during the UIC process, and UIC activity. However, the relationship is quite weak, with a slope coefficient of 0.27. The R-square value is also small at 0.048.
It appears that volume alone is not a reliable predictor of seismic activity.
Cross validation was carried out to make sure the linear regression analysis is robust. A k-fold cross validation was carried out with k = 10 to reduce risk of overfitting.
#define variables as separate arrays
arr1 = np.array(volmag_log.Volume)
arr2 = np.array(volmag_log.magnitude)
#k-fold cross validation. developed from INM430 Practical 07
from sklearn import cross_validation
numberOfSamples = len(arr1)
evaluatedLine = np.polyval([slope, intercept], volmag_log.Volume)
plt.scatter(volmag_log.Volume, volmag_log.magnitude, alpha=0.3, color = 'r')
plt.plot(volmag_log.Volume, evaluatedLine, 'k-')
kf = cross_validation.KFold(numberOfSamples, n_folds=10)
foldCount = 0
for train_index, test_index in kf:
print("--RUN NUMBER: ", foldCount,"--")
arraySubset1 = arr1[train_index]
arraySubset2 = arr2[train_index]
slope, intercept, r_value, p_value, std_err = stats.linregress(arraySubset1, arraySubset2)
print("\n Slope: ", slope, "\n Intercept: ", intercept, "\n r_sq: ", r_value**2, "\n p_value: ", p_value,"\n std_err: ", std_err)
print(" ")
xp = np.linspace(arr1.min(), arr1.max(), 100)
evaluatedLine = np.polyval([slope, intercept], xp)
# Overlay line on graph for each run
plt.plot(xp, evaluatedLine, 'k--', linewidth = 1, alpha = 0.3)
plt.title("Volume and Magnitide Cross Validation")
plt.xlabel("Volume")
plt.ylabel("Magnitude")
foldCount += 1
The outputs of the runs were generally consistent each time, but showed signs of variability, ranging from 0.24 to 0.31. Furthermore, there was not much variability in the r-squared values, as these continued to perform poorly.
Model testing was then carried out. Here, unseen data points were inserted to the regression model, and the estimated values to the actaual values were compared to see how much they vary.
for train_index, test_index in kf:
print("-------- We are having the run: ", foldCount )
arraySubset1 = arr1[train_index]
arraySubset2 = arr2[train_index]
unseenSubset1 = arr1[test_index]
unseenSubset2 = arr2[test_index]
slope, intercept, r_value, p_value, std_err = stats.linregress(arraySubset1, arraySubset2)
# Use the regression models to estimate the unseen "medIncome" values
# given their ViolentCrimesPerPop values.
estimatedValues = slope * unseenSubset1 + intercept
# check the differences between the estimates and the real values
differences = unseenSubset2 - estimatedValues
print (np.average(differences))
foldCount += 1
The model showed some variability again, with errors as high as as low as, -0.08, but as high as -1.12. Despit the variability, these were generally small with respect to the scale of the magnitude variable.
The same regression analysis process was carried out between the pressure and magniude variables.
# First do a Pearson correlation
corrPearson, pVal1 = stats.pearsonr(psimag_log.PSI, psimag_log.magnitude)
print("Correlation Pearson: ", corrPearson, pVal1)
# followed by a Spearman
corrSpearman, pVal2 = stats.spearmanr(psimag_log.PSI, psimag_log.magnitude)
print ("Correlation Spearman: ", corrSpearman, pVal2)
This time, Spearman correlation appears to perform slighty better, likely because it is more suited to non-linear relationships. Linear regression using the statsmodels package was carried out, and the key values returned.
slope, intercept, r_value, p_value, std_err = stats.linregress(psimag_log.PSI, psimag_log.magnitude)
print ("Slope: ", slope)
print ("Intercept: ", intercept)
print ("p_value: ", p_value)
print ("std_err: ", std_err)
print ("Rsq: ", r_value**2)
It appears there is more of a negative correlation between fluid injection pressure and seismic activity. This could be expected, and may indicate that indicate UIC sites would be unlikely to inject fluid at high pressure for high volumes.
However, again there was generally a poor r-squared result of 0.05 and high p-value. This again could indicate that UIC injection pressure, like volume, is not a good predictor of seismic activity.
line = slope*psimag_log.PSI+intercept
trace1 = Scatter(x = psimag_log.PSI,
y = psimag_log.magnitude,
mode='markers',
marker=dict(size=8, color = 'rgba(225, 225, 0, 0.3)')
)
trace2 = Scatter(x=psimag_log.PSI,
y=line,
mode='lines',
marker=Marker(color='rgb(0, 0, 0)'),
name='Regression Line')
layout = Layout(title= 'Pressure & Magnitude Regression (2011-2017)',
xaxis= dict(title= 'Pressure',ticklen= 0.5, zeroline= True, gridwidth= 0.5),
yaxis=dict(title= 'Magnitude',ticklen= 0.5, gridwidth= 0.5))
data = [trace1, trace2]
fig2222 = dict(data = data, layout = layout)
iplot(fig2222)
#define variables as separate arrays
arr3 = np.array(psimag_log.PSI)
arr4 = np.array(psimag_log.magnitude)
#k-fold cross validation. developed from INM430 Practical 07
from sklearn import cross_validation
numberOfSamples = len(arr3)
evaluatedLine = np.polyval([slope, intercept], psimag_log.PSI)
plt.scatter(psimag_log.PSI, psimag_log.magnitude, alpha=0.3, color = 'Yellow')
plt.plot(psimag_log.PSI, evaluatedLine, 'k-')
kf = cross_validation.KFold(numberOfSamples, n_folds=10)
foldCount = 0
for train_index, test_index in kf:
print("--RUN NUMBER: ", foldCount,"--")
arraySubset3 = arr3[train_index]
arraySubset4 = arr4[train_index]
slope, intercept, r_value, p_value, std_err = stats.linregress(arraySubset3, arraySubset4)
print("\n Slope: ", slope, "\n Intercept: ", intercept, "\n r_sq: ", r_value**2, "\n p_value: ", p_value,"\n std_err: ", std_err)
print(" ")
xp = np.linspace(arr3.min(), arr3.max(), 100)
evaluatedLine = np.polyval([slope, intercept], xp)
# Overlay line on graph for each run
plt.plot(xp, evaluatedLine, 'k--', linewidth = 1, alpha = 0.3)
plt.title("Pressure and Magnitide Cross Validation")
plt.xlabel("Pressure")
plt.ylabel("Magnitude")
foldCount += 1
The outputs were generally consistent for each run, but showed signs of variability, ranging from -0.302 to -0.398. Furthermore, there was not much variability in the r-squared values, as these continued to perform poorly.
Model testing was then carried out.
for train_index, test_index in kf:
print("-------- We are having the run: ", foldCount )
arraySubset3 = arr3[train_index]
arraySubset4 = arr4[train_index]
unseenSubset3 = arr3[test_index]
unseenSubset4 = arr4[test_index]
slope, intercept, r_value, p_value, std_err = stats.linregress(arraySubset3, arraySubset4)
# Use the regression models to estimate the unseen "medIncome" values
# given their ViolentCrimesPerPop values.
estimatedValues = slope * unseenSubset3 + intercept
# check the differences between the estimates and the real values
differences = unseenSubset4 - estimatedValues
print (np.average(differences))
foldCount += 1
Model testing showed variability in predicting magnitude output. Although the errors were not very large relative to the predictor magnitude scale, the variability does indicate a non-robust model. It can be determined therefore that UIC injection pressure is not a string predictor of seimic activity alone.
Multiple regression is used to when we want to predict the value of a variable based on the value of two or more variables. In this case, the dependant variable is the sum monthly magnitude per county, and the independent variables are total UIC injection volume for that month, and total UIC injection pressure for that month.
import statsmodels.api as sm
regressors = np.matrix([volmag.Volume, psimag.PSI])
regressors
The regressors were transposed so that there are equal numer of rows in each.
#transpose regressors, check shape
regressors_t = np.transpose(regressors)
print("Regressor shape: ",regressors_t.shape)
#define magnitude array, check shape
mag_arr = np.array(volmag.magnitude)
print("Magnitude (predictor) shape: ", mag_arr.shape)
import statsmodels.api as sm
r_squ= []
numberOfSamples = len(mag_arr)
kf = cross_validation.KFold(numberOfSamples, n_folds=5)
foldCount = 0
for train_index, test_index in kf:
print("--RUN NUMBER: ", foldCount,"--")
arraySubset5 = mag_arr[train_index]
arraySubset6 = regressors_t[train_index]
regressors_t = sm.add_constant(regressors_t)
model_m = sm.OLS(arraySubset5, arraySubset6, hasconst=False)
results_m = model_m.fit()
print("RUN SUMMARY: ", results_m.summary())
r_squ.append(results_m.rsquared)
foldCount += 1
average_rsq = np.array(r_squ).mean()
print("Average R squared: ", average_rsq)
The cross-validation of the multiple regression model using 5 folds shows that there is an average R-squared score of 0.315.
The output from the multiple regression shows that a regression model using both UIC injection pressure and UIC injection volume does not provide a more robust predictor of seismic activity in Oklahoma than simply using single linear regression models using UIC Injection volume and UIC injection pressure alone.
Twitter provides a rich information source for social sciences. Everyday, millions of tweets are posted by people giving opinions on a plethora of topics. Since fracking has become more of a common practice since in the 21st century, there have been a wide range of opinions about it. In the USA, there has been wide criticism about the practice over the last decade. In the UK, anti-fracking protest groups have been on the rise against the practice.
The final research question in this project aims to see whether attitudes towards fracking can be analysed using data mined from Twitter.
Tweets were retrived using the Twitter API.
Firstly, some important additional dependancies were imported.
#import dependancies
import tweepy
from textblob import TextBlob
import csv
#set working directory#
os.chdir(r'C:\Users\aczd067\OD\INM430-PriciplesOfDataScience\Fracking\Data\Twitter')
The Twitter Developer API credentials were defined below.
####input your credentials here
consumer_key = 'xBu0XgP3EM2NmQP5UcBNVnuTs'
consumer_secret = 'E4ebWcpnbfV5RRJA0KgJd09MJBlV8rdcyTrQJBkoAfJ7BvWK8L'
access_token = '1069582682517135360-wjroF2DkLqvadKvJ4jFFS7Iclb9PtF'
access_token_secret = '0ddS9LVeGNCmtPz7bGbBsFzMN3zDLlHbAgKhiwWEcBDOI'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True)
The script below scrapes twitter data for tweets containing a sepecific string. In this case, we will find all tweets in the last week related to fracking, by retrieving all tweets with the hashtag '#fracking'.
During this process, a useful library, 'TextBlob' is used to score the sentiment for each tweet. Namely, 'Polarity' (How positive or negative the sentiment is) and 'Subjectivity' (how opinionated or factual the tweet is).
*To eliminate overwriting of the dataset, the code to do this has been commented out.
# Open/Create a file to append data
#csvFile = open("tweets_final.csv", 'a')
#Use csv Writer
#csvWriter = csv.writer(csvFile)
#for tweet in tweepy.Cursor(api.search,q="#fracking",count=1000,
# lang="en",
# since="2017-04-03").items():
# analysis = TextBlob(tweet.text)
# print (tweet.created_at, tweet.text, analysis.sentiment.polarity, analysis.sentiment.subjectivity)
# csvWriter.writerow([tweet.created_at, tweet.text.encode('utf-8'), analysis.sentiment.polarity, analysis.sentiment.subjectivity])
frack_tweets = pd.read_csv('tweets_final.csv', names = ['date', 'tweet', 'sentiment_polarity', 'sentiment_subjectivity'])
frack_tweets.head()
frack_tweets.info()
We can see that 8,862 tweets were retrived from the search for a one-week period. All these tweets have information on the date, the tweet and the polarity/subjectivity score.
Lot's of the tweets show both 0 subjectivity and 0 polarity. These indicate that the algorithm couldnt assign a good score to them. These were dropped from the analysis.
tweets = frack_tweets[(frack_tweets['sentiment_polarity'] != 0 ) & (frack_tweets['sentiment_subjectivity'] != 0)]
len(tweets)
The removal of these tweets meant that 5,660 tweets containing the hashtag '#fracking' remained for analysis.
The number of positive and negative tweetd can be seen below.
print("Number of positive tweets: ",len(tweets[tweets['sentiment_polarity']>0]))
print("Number of negative tweets: ",len(tweets[tweets['sentiment_polarity']<0]))
data = [Bar(x=['Negative', 'Positive'],
y=[len(tweets[tweets['sentiment_polarity']<0]), len(tweets[tweets['sentiment_polarity']>0])],
marker=dict(color=['rgba(222,45,38,0.6)', 'rgba(0,128,0,0.6)'])
)]
layout = Layout(title='Fracking Tweet Sentiment',
)
fig991 = dict(data = data, layout = layout)
iplot(fig991)
Interestingly, it appears that the majority of tweets are positive towards fracking, with only 1,213 of 5,660 tweets negative towards fracking.
The individual tweets were visualised on the plot below for further analysis.
trace1 = Scatter(x = tweets.sentiment_polarity,
y = tweets.sentiment_subjectivity,
mode='markers',
marker=dict(size=8,
color = tweets.sentiment_polarity, #set color equal to a variable
colorscale=[[0.0, 'rgba(255, 0, 0, 0.3)'],
[0.5, 'rgba(214, 214, 214, 0.3)'],
[1.0, 'rgba(0,128,0, 0.3)']],
showscale=True),
text = tweets['tweet']
)
layout = Layout(title= 'Tweet Polarity & Subjectivity',
hovermode= 'closest',
xaxis= dict(title= 'Tweet Polarity',ticklen= 0.5, zeroline= True, gridwidth= 0.5),
yaxis=dict(title= 'Tweet Subjectivity',ticklen= 0.5, gridwidth= 0.5))
data = [trace1]
fig92 = dict(data = data, layout = layout)
iplot(fig92)
Here we can take a closer look at tweet sentiment and subjectivity. Again thr graph illustrates that more tweets are positive about fracking. Interestingly, the as tweets become more subjective and opinionated, they become more polarised. Meaning that people tend to have strong opinions about fracking.
By hovering over each of tweets allows the tweet to be viewed. Further inspection shows that the results may not be reliable. For instance, the most positive tweet in the top right of the plot is quite obviously negative about fracking, but the obvious sarcasm is not picked up by the TextBlob algorithm.
These instances are evident numerous times acriss the plot. It may be valid to say that sentiment analysis using TextBlob is not a robust method for measuring public opinion about fracking.